Cars4U (Project 2)

Submitted by Sumant Sarkar (Jan'22 Cohort)

Cars4U is a tech start-up that aims to find footholes in the used car sales market. The goal of this project is to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.

In this prokect, we will explore and visualize the dataset, build a linear regression model to predict the prices of used cars, and generate a set of insights and recommendations that will help the business. The data contains the different attributes of used cars sold in different locations. The detailed data dictionary is given below.

  1. S.No.: Serial number
  2. Name: Name of the car which includes brand name and model name
  3. Location: Location in which the car is being sold or is available for purchase (cities)
  4. Year: Manufacturing year of the car
  5. Kilometers_driven: The total kilometers driven in the car by the previous owner(s) in km
  6. Fuel_Type: The type of fuel used by the car (Petrol, Diesel, Electric, CNG, LPG)
  7. Transmission: The type of transmission used by the car (Automatic/Manual)
  8. Owner: Type of ownership
  9. Mileage: The standard mileage offered by the car company in kmpl or km/kg
  10. Engine: The displacement volume of the engine in CC
  11. Power: The maximum power of the engine in bhp
  12. Seats: The number of seats in the car
  13. New_Price: The price of a new car of the same model in INR Lakhs (1 Lakh INR = 100,000 INR)
  14. Price: The price of the used car in INR Lakhs

Observations from the basic pandas profile report

  1. Considering all attributes of all rows, 1.5% cells are blanks. This is not unusually high. So, prima-facie, the dataset can be considered good enough for this case.
  2. S.No. attribute is confirmed to be unique and not expected to play a role in the regression model. So we will drop this.
  3. Name is a high cardinality attribute. It has a lots of unique values. This field cannot be treated as a categorical variable. We will drop it.
  4. Location has 11 distinct values. We should analyze this further. But prima-facie, it appears that we can treat it as a categorical variable and by doing a 1-hot encoding, we can use it as one of the input attributes for the regression model.
  5. Year also has a few distinct values compared to the number of rows (23 vs. 7273). But as it is a numeric, we will use it directly as a regression input attribute. This makes sense intuitively since higher the year, higher is the price expected to be. The data appears to be right skewed with more data points for the recent years, although this trend is not followed for the latest three years.
  6. Kilometers_driven seems to be very skewed. The maximum value of 6.5 million kilometers seems suspicious. We need to look at these outliers closely and see if they should be removed from the data before building the model.
  7. Fuel type has no missing values and the distribution is as can be expected intuitively.
  8. Transmission has no missing values and the split between Manual and Automatic seems okay (although, on second thoughts, Automatic could be over-represented in the data)
  9. Owner_Type attribute has no missing values and the data is skewed towards one owner cars.
  10. Mileage field is String at the moment. Need to clean it up to extract the number and then see the distribution. It is missing in 2 tuples and has 450 distinct values, which is not surprising.
  11. Engine field is String at the moment. Need to clean it up to extract the number and then see the distribution. It is missing in 46 tuples and has 150 distinct values.
  12. Power field is String at the moment. Need to clean it up to extract the number and then see the distribution. It is missing in 175 tuples and has 385 distinct values.
  13. Seats has a minimum of 0 and maximum of 10. Both outliers seem suspicious. Need to look at these tuples. It is missing in 53 tuples and it is centered around 5, as expected.
  14. New_Price is right skewed. Need to look at the ones with extremely high prices like 375 lakhs INR. Could be alright. But need to check against the make/model.
  15. Price field, which represents the current market value as a used car, has an average of 9.4 lakhs which is much lower than the average New_Price, as expected. It is right skewed as well. Minimum seems fine, but the maximum should be double checked. It's missing in 1234 tuples and these have to removed before creating the model since this is the field we are trying to predict the value of.
  16. The positive correlation between (a) Year and Price (b) Year and New_Price (c) New_Price and Price make sense. The negative correlation between Year and Kilometers_driven also makes sense.

Most of the tuples have the mileage in kmpl unit. A few are in km/kg. As explained in the Project FAQs, there is no need to convert from one unit to another. We can use the numbers as they are.

We see that Mileage_num column has the number part of the mileage and it is ready to be used as an input to the regression model.

As we did earlier, let's convert the Engine_num to float.

All tuples have the same unit. This unit will play no role in the regression model. This field can now be dropped.

The missing count is still 46.

Finally, let's preprocess the Power field.

We have completed the preprocessing of these three columns.

All data types look good now. Everything that should be numeric is so.

Univariate Analysis

Observation on Year

Observation on Kilometers_Driven

Observation on Kilometers_Driven < 500K cases

There are still many outliers, but it's not as extremely skewed as earlier.

Observation on Seats

Based on the Name, it looks alright. For these models, 9/10 seats is okay.

This is suspicious. 0 Seater? Also Power is NaN. We will delete this row. Alternately, we could have set this field to blank and then deal with this the way we deal with blanks later. But I think deleting is better as the row may have other issues as well.

Observation on New_Price

Observation on Price

Observation on Power

Observation on Mileage

Observation on Engine

Observations on Location

Observations on Fuel_Type

Observations on Transmission

Observations on Owner_Type

Bivariate Analysis

Analysis

Analysis

Fuel Type vs Transmission vs Price

Fuel Type vs Transmission vs Kilometers_Driven

New_Price vs Year vs Transmission

Price vs Year vs Transmission

New_Price vs Year vs Location

Price vs Year vs Location

New_Price vs Year vs Seats

Price vs Year vs Seats

Comparison of raw values vs. log values

Coefficient Interpretations

Based on raw values of Price and New_Price

  • Coefficients of Year, Bangalore/Chennai/Coimbatore/Hyderabad/Jaipur/Pune locations, Electric/LPG fuel type columns, are strongly positive.
    • Increase in these will lead to a significant increase in the price.
  • Coefficients of New_Price, Engine, Power columns, etc. are slightly positive.
    • Increase in these will lead to a marginal increase in the price.
  • Coefficients of Seats, Kolkata/Mumbai locations, Petrol/Diesel fuel type, Manual transmission columns, are strongly negative.
    • Increase in these will lead to a significant decrease in the price.
  • Coefficients of Kilometers driven, Mileage, Owner count, Delhi/Kochi location columns, are slightly negative.
    • Increase in these will lead to a marginal decrease in the price. (Note: The cutoff of 0.5 was chosen arbitrarily to distinguish strongly positive vs slightly positive.

Based on log values of Price and New_Price

  • Coefficients of Year, Seats, log(New Price), Engine, Power, Bangalore/Chennai/Coimbatore/Hyderabad locations, Diesel/Electric/LPG fuel type columns are positive.
    • Increase in these will lead to an increase in the log(price).
  • Coefficients of kilometers driven, Mileage, Owner count, Delhi/Jaipur/Kochi/Kolkata/Mumbai/Pune locations, Petrol fuel type, Manual transmission columns, are negative.
    • Increase in these will lead to a decrease in the price.